Skip to content

[internal-dns] register and publish ddmd in the switch zone#10381

Open
zeeshanlakhani wants to merge 1 commit intomainfrom
zl/ddm-internal-dns
Open

[internal-dns] register and publish ddmd in the switch zone#10381
zeeshanlakhani wants to merge 1 commit intomainfrom
zl/ddm-internal-dns

Conversation

@zeeshanlakhani
Copy link
Copy Markdown
Collaborator

DDMD has always run in the switch zone alongside Dendrite, MGS, and MGD, but it was never registered in internal DNS, leaving no path for a cross-host consumer to discover it. This adds ServiceName::Ddm, plumbs ddm_port through the host-zone switch (RSS plan + reconfigurator DNS execution), threads an Overridables::ddm_ports map for the test suite, and lands a DdmInstance dropshot sim in test utils so that the test harness registers a real DDM port in DNS the same way it does for the other switch-zone services.

We also drop the duplicate DDMD_PORT const in ddm-admin-client in favor of the canonical omicron_common::address::DDMD_PORT. Same-host callers continue to use Client::localhost().

This was extracted from the multicast PR (zl/multicast-mgd-ddm), which uses ddmd cross-host as the first DNS-resolved consumer, as Nexus is the consumer.

DDMD has always run in the switch zone alongside Dendrite, MGS,
and MGD, but it was never registered in internal DNS, leaving no path for a
cross-host consumer to discover it. This adds `ServiceName::Ddm`,
plumbs `ddm_port` through the host-zone switch (RSS plan + reconfigurator
DNS execution), threads an `Overridables::ddm_ports` map for the
test suite, and lands a `DdmInstance` dropshot sim in test utils so
that the test harness registers a real DDM port in DNS the same way it does
for the other switch-zone services.

We also drop the duplicate DDMD_PORT const in `ddm-admin-client` in favor of
the canonical `omicron_common::address::DDMD_PORT`. Same-host
callers continue to use `Client::localhost()`.

This was extracted from the multicast PR (zl/multicast-mgd-ddm), which
uses ddmd cross-host as the first DNS-resolved consumer, as Nexus is the consumer.
Copy link
Copy Markdown
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for splitting this out!

dendrite_port: u16,
mgs_port: u16,
mgd_port: u16,
ddm_port: u16,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function taking 3 u16s in a row was already sketchy - what do you think about changing this to take something like

struct HostSwitchZonePorts {
    dendrite: u16,
    mgs: u16,
    mgd: u16,
    ddm: u16,
}

instead of four separate arguments? That way we get named parameters for the call sites, and don't have to rely on them getting the order correct, and inside this function, we can destructure them:

let HostSwitchZonePorts { dendrite, ...all the rest ... } = ports;

so we also get a compile-time check that we update this function if we change the struct?

/// In-process stand-in for the `ddmd` (Delay Driven Multipath daemon)
/// admin API.
///
/// `ddmd` runs in sled global zones and switch zones in real deployments,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the ddmd that runs in the global zones ever need to be in DNS too, or is that one only communicated with locally?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently no.

Sled global-zone ddmd is only ever accessed by its own host, where DdmReconciler advertises that sled's own prefixes or RSS / wicketd / installinator do bootstrap-time peer discovery via Client::localhost(). That's all for per-host protocol semantics, not cross-host service discovery.

The new consumer for this work is the multicast RPW in Nexus, which talks to switch-zone ddmd cross-host to read DDM peer topology so it can map sled-id to switch port. That's the only cross-host case today, and the only reason the switch-zone ddmd needs DNS publication.

There's probably a future case where we'd want to centralize sled-ddmd state across the rack, but for another time.

I've updated the doc comment so the asymmetry between switch-zone (DNS-published) and sled-global-zone (local) access is made explicit.

/// internal DNS as `ServiceName::Ddm`.
///
/// This currently has no registered routes. Any integration needing
/// concrete endpoints (e.g., peer lists) must extend the `ApiDescription`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little fishy - I think the other test-utils variants of these services is running something approximating the "real" service (albeit a stub binary for non-illumos systems - it still handles all the normal API endpoints). If this doesn't serve the same API, how can tests use it?

Copy link
Copy Markdown
Collaborator Author

@zeeshanlakhani zeeshanlakhani May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed.

Context: this came out of the multicast work. The multicast RPW in Nexus needs to resolve ServiceName::Ddm cross-host to read DDM peer topology, so we needed nexus-test-utils to publish a real DDM port in DNS. #10346 has a richer dropshot sim with /peers, set_peers, and a PeerMap for the multicast tests to sim against. I trimmed this down to bind-only because no test in this PR exercises any DDM endpoints.

Why not just subprocess the real ddmd?

ddmd has no "admin-only" mode like mgd does (via --no-bgp-dispatcher). It always brings up the discovery/exchange/routing daemons, which need real network interfaces and illumos-only kernel facilities. So we can't tokio::process::Command::new("ddmd") the way MgdInstance does for mgd, and probably why this wasn't done beforehand.

The proper fix is upstream adding an "admin-only"-like mode to ddmd in maghemite as a separate PR, then bumping the pin here, switching DdmInstance to spawn a real ddmd via subprocess (matching MgdInstance's pattern), and then wiring ddmd into the xtask download scheme. So, let me hook up a maghemite piece to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants